A Framework for Hierarchical Cost-sensitive Web Resource Acquisition∗
نویسندگان
چکیده
Many record matching problems involve information that is insufficient or incomplete, and thus solutions that classify which pairs of records are matches often involve acquiring additional information at some cost. For example, web resources impose extra query or download time. As the amount of resources that can be acquired is large, solutions invariably acquire only a subset of the resources to achieve a balance between acquisition cost and benefit. At the same time, resources often have hierarchical dependencies between themselves, e.g., the search engine results for two queries must be obtained before the TF-IDF cosine similarity between their snippets can be computed. We propose a framework for performing cost-sensitive acquisition of resources with hierarchical dependencies, and apply it to the web resource context. Our framework is versatile, applicable to a large variety of problems. We show that many problems involving selective resource acquisitions can be formulated using resource dependency graphs. We then solve the resource acquisition problem by casting it as a combinatorial search problem. As the support vector machine is commonly used to effectively solve record matching problems, we also propose a benefit function that works with this classifier. Finally, we demonstrate the effectiveness of our acquisition framework on record matching problems.
منابع مشابه
Cost-sensitive Web-based Information Acquisition for Record Matching
In many record matching problems, the input data is either ambiguous or incomplete, making the record matching task difficult. However, for some domains, evidence for record matching decisions are readily available in large quantities on the Web. These resources may be retrieved by making queries to a search engine, making the Web a valuable resource. On the other hand, Web resources are slow t...
متن کاملAN EFFECTIVE METHOD FOR SIMULTANEOUSLY CONSIDERING TIME-COST TRADE-OFF AND CONSTRAINT RESOURCE SCHEDULING USING NONLINEAR INTEGER FRAMEWORK
Critical Path Method (CPM) is one of the most popular techniques used by construction practitioners for construction project scheduling since the 1950s. Despite its popularity, CPM has a major shortcoming, as it is schedule based on two impractical acceptance that the project deadline is not bounded and that resources are unlimited. The analytical competency and computing capability of CPM thus...
متن کاملAn Efficient Resource Allocation for Processing Healthcare Data in the Cloud Computing Environment
Nowadays, processing large-media healthcare data in the cloud has become an effective way of satisfying the medical userschr('39') QoS (quality of service) demands. Providing healthcare for the community is a complex activity that relies heavily on information processing. Such processing can be very costly for organizations. However, processing healthcare data in cloud has become an effective s...
متن کاملAdaptive Information Analysis in Higher Education Institutes
Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...
متن کاملAdaptive Information Analysis in Higher Education Institutes
Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010